Proteomics, Post-translational Modifications, andIntegrative Analyses Reveal MolecularHeterogeneity within Medulloblastoma Subgroups¶

PROTEOMICS DATASET RE-ANALYSIS¶

Abstract¶

There is a pressing need to identify therapeutic targets in tumors with low mutation rates such as the malignant pediatric brain tumor medulloblastoma. To address this challenge, we quantitatively profiled global proteomes and phospho-proteomes of 45 medulloblastoma samples. Integrated analyses revealed that tumors with similar RNA expression vary extensively at the post-transcriptional and post-translational levels. We identified distinct pathways associated with two subsets of SHH tumors, and found post-translational modifications of MYC that are associated with poor outcomes in Group 3 tumors. We found kinases associated with subtypes and showed that inhibiting PRKDC sensitizes MYC-driven cells to radiation. Our study shows that proteomics enables a more comprehensive, functional readout, providing a foundation for future therapeutic strategies.

|experimental\_design| Figure 1: Summary of data types included in this study, depth of proteomic data types, and cohort composition.

Consensus clustering, principal component analysis (PCA), and t-distributed stochastic neighbor embedding (t-SNE) carried out separately on each data type predominantly revealed the known subgroups: Group 3, Group 4, and SHH (Figures (Figures2A,2A, S1-S3). The proteomic data also identified very stable subsets of two known subgroups. Here, we refer to Group 3 clusters as Group 3a (G3a) and Group 3b (G3b), and SHH clusters as SHHa and SHHb.

subgroups Figure 2: Comparison of clustering results.

REF: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6372116/

Objectives¶

Use CKG’s analytics core to:

Re-analyze the proteomics dataset:
- Preprocessing
- Stratification looking at covariates: Age, Gender, Metastatic status and histology
- Differential regulation including covariates
- Functional enrichment
Identify patients subgroups
- Integration of all omics datasets using Similarity Network Fusion (REF: https://www.nature.com/articles/nmeth.2810).
- Idenfity clinical and molecular differences between the identifierd subgroups
Investigate correlating PTMs on relevant protein complexes

Loading the Data¶

In this analysis we use:

Clinical metadata
RNA-sequencing
Proteomics
Phosphoproteomics
Phosphotyrosine proteomics
Acetylomics

This datasets have been previously formated (samples x proteins) and standardized to UniProt identifiers (data available in /assets) (see notebook Meduloblastoma Data Processing.ipynb).

[2]:

import pandas as pd
import numpy as np

from ckg.analytics_core.analytics import analytics
from ckg.analytics_core.viz import viz

from ckg.graphdb_builder import mapping
from ckg.graphdb_connector import connector

from plotly.offline import init_notebook_mode, iplot
%matplotlib inline
init_notebook_mode(connected=True)

C:\Users\sande\.conda\envs\ckgenv\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning: The package outdated is out of date. Your version is 0.2.0, the latest is 0.2.1.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
  **kwargs
C:\Users\sande\.conda\envs\ckgenv\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning: The package pingouin is out of date. Your version is 0.3.10, the latest is 0.3.11.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
  **kwargs

WGCNA functions will not work. Module Rpy2 not installed.
R functions will not work. Module Rpy2 not installed.

[3]:

clinical_data = pd.read_excel('../../assets/samples.xlsx', header=0)
rnaseq_data = pd.read_csv("../../assets/rnaseq.tsv", sep='\t', header=0)
proteomics_data = pd.read_csv("../../assets/proteomics.tsv", sep='\t', header=0)
phosphoSTY_data = pd.read_csv("../../assets/phosphoSTY.tsv", sep='\t', header=0)
phosphoY_data = pd.read_csv("../../assets/phosphoY.tsv", sep='\t', header=0)
acetylomics_data = pd.read_csv("../../assets/acetylomics.tsv", sep='\t', header=0)

1. Re-analyze the proteomics dataset¶

[4]:

proteomics_data.head()

[4]:

	index	Q6MZP0	Q8NF91	Q8WXH0	Q15149	Q15149.1	Q15149.2	Q15149.3	P58107	B4DTV0	...	A0A0A0MR82	Q9HD23	P78325	O43763	P28749	A0A024QZT7	Q96MC2	W5XKT8	Q5VZ18	group
0	MB018	1.068076	-0.696709	0.931591	2.420232	2.420232	2.436102	2.429754	1.690194	0.122202	...	NaN	0.620531	NaN	-2.797946	NaN	NaN	NaN	-0.290428	1.607669	GR3
1	MB095	-2.623296	1.433604	-2.755208	-0.515201	-0.522668	-0.515201	-0.522668	-0.398223	-0.983114	...	NaN	NaN	-2.658141	NaN	NaN	NaN	-1.707383	NaN	NaN	GR3
2	MB106	-0.923848	-1.281532	-1.003560	-1.279489	-1.304015	-1.304015	-1.301972	0.318850	-2.362762	...	NaN	-0.443529	NaN	0.339289	NaN	NaN	NaN	-0.504846	-0.686754	GR3
3	MB170	-2.355968	1.185584	-2.055772	0.250797	0.252697	0.241297	0.246997	0.294496	0.203297	...	NaN	NaN	-4.029845	NaN	NaN	NaN	-0.075999	NaN	NaN	GR3
4	MB226	0.077085	-3.470951	-0.535310	-1.351123	-1.353264	-1.359688	-1.359688	-1.571671	-2.201195	...	NaN	-0.668067	NaN	0.246243	1.181965	NaN	NaN	NaN	NaN	GR3

5 rows × 13127 columns

Preprocessing:¶

Remove WNT samples (low number of samples)
Filter proteins with more than 80% missing values in all groups
Normalize data using median-centered normalization
Impute missing values using a mixed model: K-Nearest Neighbors algorithm (KNN) if at least 60% valid values, otherwise Minimum Probability (MinProb) drawing random values from a down-shifted Gaussian distribution
Include Protein name in columns using CKG’s mapping functionality
Include relevant clinical features: Age, Gender, Metastatic status, histology

[5]:

proteomics_data = proteomics_data[proteomics_data['group'] != 'WNT']

[6]:

valid_proteins = analytics.extract_percentage_missing(proteomics_data, missing_max=0.2, drop_cols=['index'], group='group', how='all')
proteomics_data = proteomics_data[valid_proteins + ['index', 'group']]

[7]:

proteomics_data = analytics.normalize_data(proteomics_data, method='median', normalize='samples')

[8]:

proteomics_data = analytics.imputation_mixed_norm_KNN(proteomics_data, index_cols=['index','group']).reset_index()
proteomics_data = proteomics_data.rename(columns={'index':'sample'})

[9]:

m = mapping.getMappingFromDatabase(proteomics_data.columns, node="Protein", attribute_from='id', attribute_to='name')
columns = [m[c]+"~"+c if c in m and m[c] is not None else c for c in proteomics_data.columns]
proteomics_data.columns = columns

[10]:

proteomics_data.head()

[10]:

	sample	group	FLJ10292~A0A023T6R1	RBM8~A0A023T787	A0A024QYR3	TM9SF2~A0A024QYR8	A0A024QYS2	AP1S1~A0A024QYT6	EIF3S8~A0A024QYU9	RABL5~A0A024QYV3	...	PMF1-BGLAP~U3KQ54	CSNK1G1~U3KQB3	LSM4~U3KQK1	HEL-76~V9HW21	MPND~W4VSR2	YP_003024026	YP_003024028	YP_003024029	YP_003024030	YP_003024036
0	MB018	GR3	0.305789	0.121693	-0.306807	-1.420907	-1.992240	0.080430	-0.586125	-0.170322	...	1.010434	-0.398855	0.362923	2.222930	0.454971	0.169304	-3.687196	-1.779577	0.673982	1.286578
1	MB095	GR3	0.161486	0.004686	-0.117270	0.385487	0.106731	-0.577716	0.938022	0.397931	...	0.440243	1.082378	0.149042	-0.338782	1.171978	1.405934	0.320776	1.109756	-0.313893	1.475623
2	MB106	GR3	0.246131	0.722363	0.252263	0.248175	0.037652	1.270131	0.865436	0.188902	...	1.186331	-0.450842	-0.465150	-2.932151	-0.575521	0.571113	-0.064543	0.368766	-1.517764	1.302834
3	MB170	GR3	0.158128	0.141028	-1.245953	-1.109155	-1.055955	-1.152854	0.448824	-0.016670	...	0.942817	-0.839358	0.255027	1.818705	0.072629	-0.073669	-0.537262	-1.215553	0.564722	-0.225667
4	MB226	GR3	0.829814	0.617831	-0.223676	-0.799670	-0.551286	-0.803952	0.647809	-0.435659	...	1.281616	0.088945	-0.294337	0.016143	-0.225818	-0.465636	-0.119610	-1.510562	0.589995	-0.559851

5 rows × 11137 columns

[11]:

    proteomics_data = proteomics_data.set_index(['sample', 'group']).join(clinical_data.set_index(['sample', 'group'])).reset_index()
    proteomics_data = proteomics_data.drop('Proteome-based ', axis=1)
    proteomics_data.loc[proteomics_data['Metastatic  Status'].isna(), 'Metastatic  Status'] = 'M3'

[12]:

proteomics_data.head()

[12]:

	sample	group	FLJ10292~A0A023T6R1	RBM8~A0A023T787	A0A024QYR3	TM9SF2~A0A024QYR8	A0A024QYS2	AP1S1~A0A024QYT6	EIF3S8~A0A024QYU9	RABL5~A0A024QYV3	...	MPND~W4VSR2	YP_003024026	YP_003024028	YP_003024029	YP_003024030	YP_003024036	Age At Diagnosis (years)	Gender	Metastatic Status	Histology
0	MB018	GR3	0.305789	0.121693	-0.306807	-1.420907	-1.992240	0.080430	-0.586125	-0.170322	...	0.454971	0.169304	-3.687196	-1.779577	0.673982	1.286578	13.0	F	M0	LCA
1	MB095	GR3	0.161486	0.004686	-0.117270	0.385487	0.106731	-0.577716	0.938022	0.397931	...	1.171978	1.405934	0.320776	1.109756	-0.313893	1.475623	3.0	M	M0	classic
2	MB106	GR3	0.246131	0.722363	0.252263	0.248175	0.037652	1.270131	0.865436	0.188902	...	-0.575521	0.571113	-0.064543	0.368766	-1.517764	1.302834	3.0	M	M3	LCA
3	MB170	GR3	0.158128	0.141028	-1.245953	-1.109155	-1.055955	-1.152854	0.448824	-0.016670	...	0.072629	-0.073669	-0.537262	-1.215553	0.564722	-0.225667	16.0	M	M3	LCA
4	MB226	GR3	0.829814	0.617831	-0.223676	-0.799670	-0.551286	-0.803952	0.647809	-0.435659	...	-0.225818	-0.465636	-0.119610	-1.510562	0.589995	-0.559851	5.0	M	M3	classic

5 rows × 11141 columns

Sample Stratification¶

We use Principal Component Analysis to visualize whether the groups cluster together.

[13]:

pca, args = analytics.run_pca(proteomics_data, drop_cols=['Gender', 'Metastatic  Status', 'Age At Diagnosis (years)', 'Histology'], group='group', annotation_cols=['sample'], components=2)

[14]:

args.update({'title':"PCA Medulloblastoma groups", 'loadings':10, 'factor': 400})
figure = viz.get_pca_plot(pca, identifier='pca_plot', args=args)
iplot(figure.figure)
viz.save_DASH_plot(figure, name="PCA_Medulloblastoma_groups", plot_format='png', directory='.')

We can also color based on possible covariates: Age, Gender, Metastatic status and histology

[15]:

pca, args = analytics.run_pca(proteomics_data, drop_cols=['Proteome-based', 'Gender', 'group', 'Metastatic  Status', 'group', 'Gender', 'Histology'], group='Age At Diagnosis (years)', annotation_cols=['sample'], components=2)

[16]:

args.update({'title':"PCA Medulloblastoma by Age", 'loadings':10, 'factor': 400})
figure = viz.get_pca_plot(pca, identifier='pca_plot', args=args)
iplot(figure.figure)
viz.save_DASH_plot(figure, name="PCA_Medulloblastoma_age", plot_format='png', directory='.')

[17]:

pca, args = analytics.run_pca(proteomics_data, drop_cols=['Proteome-based', 'Age At Diagnosis (years)', 'group', 'Metastatic  Status', 'Histology'], group='Gender', annotation_cols=['sample'], components=2)

[ ]:

[18]:

pca, args = analytics.run_pca(proteomics_data, drop_cols=['Proteome-based', 'Gender', 'group','Age At Diagnosis (years)' , 'Gender', 'Histology'], group='Metastatic  Status', annotation_cols=['sample'], components=2)

[19]:

args.update({'title':"PCA Medulloblastoma by Metastatic Status", 'loadings':10, 'factor': 400})
figure = viz.get_pca_plot(pca, identifier='pca_plot', args=args)
iplot(figure.figure)
viz.save_DASH_plot(figure, name="PCA_Medulloblastoma_metastatics", plot_format='png', directory='.')

[20]:

pca, args = analytics.run_pca(proteomics_data, drop_cols=['Proteome-based', 'Gender', 'group', 'Metastatic  Status', 'group', 'Gender', 'Age At Diagnosis (years)'], group='Histology', annotation_cols=['sample'], components=2)

[21]:

args.update({'title':"PCA Medulloblastoma by Histology", 'loadings':10, 'factor': 400})
figure = viz.get_pca_plot(pca, identifier='pca_plot', args=args)
iplot(figure.figure)
viz.save_DASH_plot(figure, name="PCA_Medulloblastoma_histology", plot_format='png', directory='.')

We can also stratify looking at biological processes (GO terms) common to the studied cohort. We use single-sample GSEA (REF:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2783335/) to project each sample onto a 2D space of normalized gene set enrichment scores.

We extract the Gene Ontology annotation from CKG’s graph database using the connector module.

[22]:

go_terms_query = "MATCH (p:Protein)-[]-(bp:Biological_process) WHERE (p.name+'~'+p.id) IN {} RETURN DISTINCT (p.name+'~'+p.id) AS identifier,bp.name AS annotation"
go_terms_query = go_terms_query.format(proteomics_data.columns.tolist())
annotation = connector.run_query(go_terms_query)

[23]:

annotation.head()

[23]:

	annotation	identifier
0	mitochondrial genome maintenance	AKT3~Q9Y243
1	mitochondrial genome maintenance	MGME1~Q9BQP7
2	mitochondrial genome maintenance	SLC25A33~Q9BSK2
3	mitochondrial genome maintenance	LONP1~P36776
4	mitochondrial genome maintenance	OPA1~O60313

[24]:

ssgsea_data = analytics.run_ssgsea(data=proteomics_data.drop(['Metastatic  Status', 'Gender', 'Histology', 'Age At Diagnosis (years)'], axis=1), annotation=annotation, annotation_col='annotation',
                     identifier_col='identifier', set_index=['group', 'sample'], outdir=None,
                     min_size=10, scale=False, permutations=0)

[25]:

pca, args = analytics.run_pca(ssgsea_data['nes'], drop_cols=[ ], group='group', annotation_cols=['sample'], components=2)

[26]:

args.update({'title':"Functional PCA Medulloblastoma groups", 'loadings':25, 'factor': 15})
figure = viz.get_pca_plot(pca, identifier='pca_plot', args=args)
iplot(figure.figure)
viz.save_DASH_plot(figure, name="Functional_PCA_Medulloblastoma_groups", plot_format='png', directory='.')

Differential Regulation¶

We perform ANCOVA analysis including covariates: Age, Gender, Metastatic status and Histology.

[27]:

pro_anova_result = analytics.run_ancova(proteomics_data, covariates=['Age At Diagnosis (years)', 'Gender', 'Histology', 'Metastatic  Status'], drop_cols=['sample'], subject=None, permutations=0)

[28]:

pro_anova_result.shape

[28]:

(33405, 23)

[29]:

anova_significant = pro_anova_result[pro_anova_result.rejected & (pro_anova_result['posthoc padj'] <= 0.01)]

[30]:

anova_significant['identifier'].unique().shape

[30]:

(2590,)

[31]:

volcano_plot = viz.run_volcano(pro_anova_result, identifier='proteomics_volcanos',
            args={'alpha':0.01, 'fc':2, 'num_annotations': 80,
                  'colorscale':'Blues', 'showscale': False,
                  'marker_size':8, 'x_title':'log2FC', 'y_title':'-log10(pvalue)'})
i = 0
for plot in volcano_plot:
    iplot(plot.figure)
    viz.save_DASH_plot(plot, name="volcano_plot_"+str(i), plot_format='png', directory='.')
    i += 1

Functional Enrichment Analysis¶

[32]:

enrichment_results = analytics.run_up_down_regulation_enrichment(pro_anova_result, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', correction='fdr_bh', alpha=0.01, lfc_cutoff=1)

[ ]:

[33]:

figures = viz.get_enrichment_plots(enrichment_results, identifier='enrichment', args={'width':2200})
i = 0
for fig in figures:
    iplot(fig.figure)
    viz.save_DASH_plot(fig, name="enrichment_"+str(i), plot_format='png', directory='.')
    i += 1

References¶

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6372116/